Safe Policy Iteration
نویسندگان
چکیده
CONTRIBUTIONS 1. Theoretical contribution. We introduce a new, more general lower bound to the policy improvement of an arbitrary policy compared to another policy based on the ability to bound the distance between the future state distributions. 2. Algorithmic contribution. We define two approximate policy–iteration algorithms whose policy improvement moves toward the estimated greedy policy by maximizing the policy improvement bounds. 3. Empirical contribution. We report results on a simple chain walk and BlackJack domains that confirm the main theoretical findings. PROBLEM • Classical API approaches may generate a policy πt+1 that performs worst than the previous policy πt. • This undesired improvement may lead to the policy oscillation phenomenon that can prevent convergence to the optimal policy and degrade the learning process. • Our “safe” approach tries to overcome this issue visiting a sequence of policies with monotonic improving performance. Following this approach, the policy is constrained to improve overtime and, as a consequence, the degradation of the policy performance between consecutive iteration is prevented.
منابع مشابه
Policy Iteration in Finite Templates Domain
We prove in this paper that policy iteration can be generally defined in finite domain of templates using Lagrange duality. Such policy iteration algorithm converges to a fixed point when for very simple technique condition holds. This fixed point furnishes a safe over-approximation of the set of reachable values taken by the variables of a program. We prove also that policy iteration can be ea...
متن کاملNonconvex Policy Search Using Variational Inequalities
Policy search is a class of reinforcement learning algorithms for finding optimal policies in control problems with limited feedback. These methods have been shown to be successful in high-dimensional problems such as robotics control. Though successful, current methods can lead to unsafe policy parameters that potentially could damage hardware units. Motivated by such constraints, we propose p...
متن کاملParallel Optimization of Motion Controllers via Policy Iteration
This paper describes a policy iteration algorithm for optimizing the performance of a harmonic function-based controller with respect to a user-defined index. Value functions are represented as potential distributions over the problem domain, being control policies represented as gradient fields over the same domain. All intermediate policies are intrinsically safe, i.e. collisions are not prom...
متن کاملOn Controlled Markov Chains with Optimality Requirement and Safety Constraint
We study the control of completely observed Markov chains subject to generalized safety bounds and optimality requirement. Originally, the safety bounds were specified as unit-interval valued vector pairs (lower and upper bounds for each component of the state probability distribution). In this paper, we generalize the constraint to be any linear convex set for the distribution to stay in, and ...
متن کاملNested Value Iteration for Partially Satisfiable Co-Safe LTL Specifications
Overview We describe our recent work (Lacerda, Parker, and Hawes 2015) on cost-optimal policy generation for cosafe linear temporal logic (LTL) specifications that are not satisfiable with probability one in a Markov decision process (MDP) model. We provide an overview of the approach to pose the problem as the optimisation of three standard objectives in a trimmed product MDP. Furthermore, we ...
متن کامل